[v22.3.x] archival: remove timeouts from archival_metadata_stm #12690

andrwng · 2023-08-09T20:41:19Z

Backport of #9633
Fixes #10777

CONFLICTS:

cloud stress test doesn't exist in this branch
reimplemented the broken semaphore fix, git wasn't able to apply cleanly

At a high level, the archival loop is implemented with the following steps:

while not aborted:
  wait until node is leader
  sync that offsets from previous terms are applied
  while node is leader:
    upload segments based on what is in the manifest
    replicate changes to manifest via archival stm
    wait for archival stm update to be applied (updates manifest)

The archival loop currently expects that updates to the manifest are either driven by it (or the housekeeping service, though ignoring this for the sake of example). One case where this invariant is broken is the case where the replication/apply of archival batches times out -- the actual apply of the op may proceed as a part of the raft::state_machine background fiber, and the archival loop itself may proceed without waiting, meaning the updates to the manifest may race with a subsequent iteration of the archival loop.

This commit remediates this by removing the timeout from the archival_metadata_stm::do_replicate_commands() call, forcing the caller to wait indefinitely for the op to complete. In the event of an unsuccessful wait, if the persisted_stm is still leader, it will explicitly step down to force the archival loop to re-sync in a new term.

Backports Required

Release Notes

Bug Fixes

Fixes an issue with the archival upload path that could contribute to data loss when archival metadata updates took a long time to be replicated and applied.

The underlying wait already takes one, this is just a pass-through. (cherry picked from commit 7be3f80)

There's an upcoming change to remove the timeout from the archival_metadata_stm. Making that change will be much simpler if the signature of persisted_stm::wait_no_throw() is a timeout instead of a duration, since the underlying raft::state_machine::wait() call uses the model::no_timeout deadline value to determine whether to set a timer. Along the way I added an optional abort source reference that is currently unused. (cherry picked from commit cac4ac9)

At a high level, the archival loop is implemented with the following steps: ``` while not aborted: wait until node is leader sync that offsets from previous terms are applied while node is leader: upload segments based on what is in the manifest replicate changes to manifest via archival stm wait for archival stm update to be applied (updates manifest) ``` The archival loop currently expects that updates to the manifest are either driven by it (or the housekeeping service, though ignoring this for the sake of example). One case where this invariant is broken is the case where the replication/apply of archival batches times out -- the actual apply of the op may proceed as a part of the raft::state_machine background fiber, and the archival loop itself may proceed without waiting, meaning the updates to the manifest may race with a subsequent iteration of the archival loop. This commit remediates this by removing the timeout from the archival_metadata_stm::do_replicate_commands() call, forcing the caller to wait indefinitely for the op to complete. (cherry picked from commit 0dd1dc8)

CONFLICT: - ntp_archiver_service has changed a bunch since 22.3, but the effective change is the same here: catch broken semaphore exceptions Replication may throw a ss::broken_named_semaphore when met with a shutting down error: ``` DEBUG 2023-03-24 22:54:01,854 [shard 0] archival - [fiber4 kafka/panda-topic/7] - ntp_archiver_service.cc:494 - Updating the archival_meta_stm in read-replica mode, in-sync offset: -9223372036854775808, last uploaded offset: 0, last compacted offset: -9223372036854775808 DEBUG 2023-03-24 22:54:01,854 [shard 0] cluster - ntp: {kafka/panda-topic/7} - archival_metadata_stm.cc:188 - command_batch_builder::replicate called WARN 2023-03-24 22:54:01,854 [shard 0] cluster - ntp: {kafka/panda-topic/7} - archival_metadata_stm.cc:333 - error on replicating remote segment metadata: raft::errc:19 ERROR 2023-03-24 22:54:01,854 [shard 0] archival - [fiber4 kafka/panda-topic/7] - ntp_archiver_service.cc:252 - upload loop error: seastar::broken_named_semaphore (Semaphore broken: mutex) ``` This commit handles the exception the same we handle other shutdown exceptions: by doing nothing. (cherry picked from commit cdb2b39)

piyushredpanda · 2023-08-10T06:26:47Z

Only failure seems to be #10140

github-actions bot added the area/redpanda label Aug 9, 2023

jcsp and others added 4 commits August 9, 2023 18:39

raft: let state_machine::wait take an abort source

1080d1e

The underlying wait already takes one, this is just a pass-through. (cherry picked from commit 7be3f80)

andrwng force-pushed the v22.3.x-archival-remove-timeout branch from f9ff57c to c94ef84 Compare August 10, 2023 01:40

piyushredpanda requested a review from VladLazar August 10, 2023 04:48

VladLazar approved these changes Aug 10, 2023

View reviewed changes

piyushredpanda merged commit 609c6be into redpanda-data:v22.3.x Aug 10, 2023
8 checks passed

BenPope added this to the v22.3.23 milestone Aug 17, 2023

BenPope added the kind/backport PRs targeting a stable branch label Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v22.3.x] archival: remove timeouts from archival_metadata_stm #12690

[v22.3.x] archival: remove timeouts from archival_metadata_stm #12690

andrwng commented Aug 9, 2023 •

edited

Loading

piyushredpanda commented Aug 10, 2023

[v22.3.x] archival: remove timeouts from archival_metadata_stm #12690

[v22.3.x] archival: remove timeouts from archival_metadata_stm #12690

Conversation

andrwng commented Aug 9, 2023 • edited Loading

Backports Required

Release Notes

Bug Fixes

piyushredpanda commented Aug 10, 2023

andrwng commented Aug 9, 2023 •

edited

Loading